﻿This folder contains all the code created in order to extract Census Bureau data, create an LSTM model with k-fold cross validation, make predictions on house value, and mapping out the data and predictions using GeoJSON data from TIGERWeb API.



Census_Data_Extraction.ipynb

This code accesses the Census dataset and creates CSV files with median house value, median household income, and racial demographic data for all census tracts in the New Jersey Mercer County area.

U.S. Census Bureau geographic boundaries of census tracts are revised every 10 years, with tracts often being split into two tracts in order to better reflect drastic changes in population or income distribution in the area. Consequently, this code intends to roughly estimate the values of characteristics in the newly defined boundaries by using the data from the old census tracts.

The script contains four functions.

- fetch_census_data_by_field: Fetches census data given the FIPS codes for the region and the Census code for the field.
- update_median_census_tracts: updates the new census tracts with the values from the old census tract for the period provided. This function is only intended for median characteristics
- update_sum_census_tracts: update the values for new census tracts with the values from the old census tract, also applying their ratios from the year 2020 and the old census tract data. This function is only intended for sum characteristics
- get_row: fetches the row of data for the specified census tract ID

The script is easily adaptable to other features. In order to utilize the code and generate DataFrames for desired characteristics, the user must change the State FIPS and County FIPS to the location of interest, and the field and field name must be changed to the ACS identifier and the name of the characteristic.

The old census tracts and their corresponding new census tracts must also be changed. I could not find a simple way to acquire which census tracts were divided in 2020. My method involved opening up the map of census tracts for the county in 2020 (found here: https://www.census.gov/geographies/reference-maps/2020/geo/2020pl-maps/2020-census-tract.html) and comparing it to the map of census tracts for the county in 2010 (found here: https://www.census.gov/geographies/reference-maps/2010/geo/2010-census-tract-maps.html)

As an additional piece of advice for finding the new/old census tracts, I recommend looking at the dataset over time before using the update functions. Any rows which only contain values before 2020 were removed in the revision of census tracts in 2020. Any rows which only contain values for and after 2020 were the result of this revision.



House_Value_Prediction.ipynb

This code utilizes the CSV dataset files containing data from 2012 to 2022 on median house value, median household income, and racial demographics for all census tracts in the New Jersey Mercer County area.

This data is used to predict future house value, creating an LSTM model with various combinations of features and sequence lengths in order to determine the most effective model. K-fold cross validation is used to more accurately determine error scores.

The script contains three functions.

- reshape_wide_to_long: simply melts the DataFrame to long format and adds 'Year' column for more simple analysis
- fill_missing_values: all NA values in the DataFrame are filled using linear interpolation, then forward fill, then backward fill
- scale_by_census_tract: scales a feature of the DataFrame using MinMaxScaler between 0-1. The scalers are stored so that they may be reverse normalized after predictions

The combinations tested in the script in order to predict median house value are

- median house value
- median house value + median household income
- median house value + white population percentage
- median house value + white population percentage + median household income

Sequence lengths of 3, 5, 5, and 8 were tested on each of the combinations (sequence length is the number of time steps or data points in each input sequence, acting like a kind of sliding window. For example, a sequence length of 5 would use years 1-5 as input, then years 2-6, then years 3-7 and so on, shifting one time step forward each time.)


5-fold cross validation is used to obtain a better average of the mean squared error (MSE) and mean absolute error (MAE) scores. MSE is more sensitive to outliers due to the squaring of the error. As such, higher MSE suggests that some predicted values may be very different from their actual values. Whether or not lower MSE (less outliers) or lower MAE (closer on average) carries more weight is up to preference, but the model I ultimately chose for predictions was sequence length 4 with median house value + median household income due to its lower MSE and MAE comparatively to the other combinations.

After obtaining the model with the desired features and sequence length, the final code block labeled "Predicting" was run using that model in order to predict median house value for 2023-2025.



Data_Mapping.ipynb

This code utilizes the CSV dataset files containing data from 2012 to 2022 on median house value, median household income, and racial demographics for all census tracts in the New Jersey Mercer County area. GeoJSON data is fetched from TIGERWeb API for mapping.

This script has no functions.

After the GeoJSON data for the desired State and County is fetched successfully and stored, a merged GeoJSON file will need to be made with each of the .csv files that you wish to map. The code for all variables is largely the same - simply change the geojson_path and the csv_path to the appropriate file locations so that they can be stored in output_geojson_path.

The exception is if you're using a prediction dataset, which does not include the Unique_ID column. You will additionally need to type in your State FIPS and County FIPS where it says 34 and 021 in the code block that says "Future Predicted House Value Data" at the top.

After this, maps of the data are constructed in various forms.

- Characteristics over periods of time
- Characteristics divided into quantiles
- Predicted characteristics
- Change in value (e.g. change in median house value between years ($) (2012)
- Change in value as a percentage

For maps of changes in value between years, my maps have red sections signifying increases in house value, while blue sections signify decreases in house value, centered around 0. Be cautious of possible outliers, as one extremely high increase or decrease can disproportionately affect the color scale and make it much more difficult to discern smaller variations in the data.

The work-around used in the final code block of the script was to set the vmin and vmax myself, shading any values outside that range as black.
